{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<link href='https://fonts.googleapis.com/css?family=Passion+One' rel='stylesheet' type='text/css'><style>div.attn { font-family: 'Helvetica Neue'; font-size: 30px; line-height: 40px; color: #FFFFFF; text-align: center; margin: 30px 0; border-width: 10px 0; border-style: solid; border-color: #5AAAAA; padding: 30px 0; background-color: #DDDDFF; }hr { border: 0; background-color: #ffffff; border-top: 1px solid black; }hr.major { border-top: 10px solid #5AAA5A; }hr.minor { border: none; background-color: #ffffff; border-top: 5px dotted #CC3333; }div.bubble { width: 65%; padding: 20px; background: #DDDDDD; border-radius: 15px; margin: 0 auto; font-style: italic; color: #f00; }em { color: #AAA; }div.c1{visibility:hidden;margin:0;height:0;}div.note{color:red;}</style>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#  Enable HTML/CSS \n",
    "from IPython.core.display import HTML\n",
    "HTML(\"<link href='https://fonts.googleapis.com/css?family=Passion+One' rel='stylesheet' type='text/css'><style>div.attn { font-family: 'Helvetica Neue'; font-size: 30px; line-height: 40px; color: #FFFFFF; text-align: center; margin: 30px 0; border-width: 10px 0; border-style: solid; border-color: #5AAAAA; padding: 30px 0; background-color: #DDDDFF; }hr { border: 0; background-color: #ffffff; border-top: 1px solid black; }hr.major { border-top: 10px solid #5AAA5A; }hr.minor { border: none; background-color: #ffffff; border-top: 5px dotted #CC3333; }div.bubble { width: 65%; padding: 20px; background: #DDDDDD; border-radius: 15px; margin: 0 auto; font-style: italic; color: #f00; }em { color: #AAA; }div.c1{visibility:hidden;margin:0;height:0;}div.note{color:red;}</style>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Enter Team Member Names here (double click to edit):\n",
    "\n",
    "- Name 1:\n",
    "- Name 2:\n",
    "- Name 3:\n",
    "\n",
    "________\n",
    "\n",
    "# In Class Assignment Three\n",
    "In the following assignment you will be asked to fill in python code and derivations for a number of different problems. Please read all instructions carefully and turn in the rendered notebook (or HTML of the rendered notebook)  before the end of class.\n",
    "\n",
    "<a id=\"top\"></a>\n",
    "## Contents\n",
    "* <a href=\"#Loading\">Loading the Data</a>\n",
    "* <a href=\"#distance\">Measuring Distances</a>\n",
    "\n",
    "** Available after class begins: **\n",
    "* <a href=\"#KNN\">K-Nearest Neighbors</a>\n",
    "* <a href=\"#naive\">Naive Bayes</a>\n",
    "\n",
    "________________________________________________________________________________________________________\n",
    "<a id=\"Loading\"></a>\n",
    "<a href=\"#top\">Back to Top</a>\n",
    "## Downloading the Document Data\n",
    "Please run the following code to read in the \"20 newsgroups\" dataset from sklearn's data loading module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "features shape: (11314, 130107)\n",
      "target shape: (11314,)\n",
      "range of target: 0 19\n",
      "Data type is <class 'scipy.sparse.csr.csr_matrix'> 0.1214353154362896 % of the data is non-zero\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups_vectorized\n",
    "import numpy as np\n",
    "from __future__ import print_function\n",
    "\n",
    "# this takes about 30 seconds to compute, read the next section while this downloads\n",
    "ds = fetch_20newsgroups_vectorized(subset='train')\n",
    "\n",
    "# this holds the continuous feature data (which is tfidf)\n",
    "print('features shape:', ds.data.shape) # there are ~11000 instances and ~130k features per instance\n",
    "print('target shape:', ds.target.shape) \n",
    "print('range of target:', np.min(ds.target),np.max(ds.target))\n",
    "print('Data type is', type(ds.data), float(ds.data.nnz)/(ds.data.shape[0]*ds.data.shape[1])*100, '% of the data is non-zero')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding the Dataset\n",
    "Look at the description for the 20 newsgroups dataset at http://qwone.com/~jason/20Newsgroups/. You have just downloaded the \"vectorized\" version of the dataset, which means all the words inside the articles have gone through a transformation that binned them into 130 thousand features related to the words in them.  \n",
    "\n",
    "**Question Set 1**:\n",
    "- How many instances are in the dataset? \n",
    "- What does each instance represent? \n",
    "- How many classes are in the dataset and what does each class represent?\n",
    "- Would you expect a classifier trained on this data would generalize to documents written in the past week? Why or why not?\n",
    "- Is the data represented as a sparse or dense matrix?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Enter your answer here:\n",
    "\n",
    "*Double click to edit*\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "<a id=\"distance\"></a>\n",
    "<a href=\"#top\">Back to Top</a>\n",
    "## Measures of Distance\n",
    "In the following block of code, we isolate three instances from the dataset. The instance \"`a`\" is from the group *computer graphics*, \"`b`\" is from from the group *recreation autos*, and \"`c`\" is from group *recreation motorcycle*. **Exercise for part 2**: Calculate the: \n",
    "- (1) Euclidean distance\n",
    "- (2) Cosine distance \n",
    "- (3) Jaccard similarity \n",
    "\n",
    "between each pair of instances using the imported functions below. Remember that the Jaccard similarity is only for binary valued vectors, so convert vectors to binary using a threshold. \n",
    "\n",
    "**Question for part 2**: Which distance seems more appropriate to use for this data? **Why**?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Instance A is from class comp.graphics\n",
      "Instance B is from class rec.autos\n",
      "Instance C is from class rec.motorcycles\n",
      "\n",
      "\n",
      "Euclidean Distance\n",
      " ab: Placeholder ac: Placeholder bc: Placeholder\n",
      "Cosine Distance\n",
      " ab: Placeholder ac: Placeholder bc: Placeholder\n",
      "Jaccard Dissimilarity (vectors should be boolean values)\n",
      " ab: Placeholder ac: Placeholder bc: Placeholder\n",
      "\n",
      "\n",
      "The most appropriate distance is...\n",
      "Placeholder\n"
     ]
    }
   ],
   "source": [
    "from scipy.spatial.distance import cosine\n",
    "from scipy.spatial.distance import euclidean\n",
    "from scipy.spatial.distance import jaccard\n",
    "import numpy as np\n",
    "\n",
    "# get first instance (comp)\n",
    "idx = 550\n",
    "a = ds.data[idx].todense()\n",
    "a_class = ds.target_names[ds.target[idx]]\n",
    "print('Instance A is from class', a_class)\n",
    "\n",
    "# get second instance (autos)\n",
    "idx = 4000\n",
    "b = ds.data[idx].todense()\n",
    "b_class = ds.target_names[ds.target[idx]]\n",
    "print('Instance B is from class', b_class)\n",
    "\n",
    "# get third instance (motorcycle)\n",
    "idx = 7000\n",
    "c = ds.data[idx].todense()\n",
    "c_class = ds.target_names[ds.target[idx]]\n",
    "print('Instance C is from class', c_class)\n",
    "\n",
    "# Enter distance comparison below for each pair of vectors:\n",
    "p = 'Placeholder'\n",
    "print('\\n\\nEuclidean Distance\\n ab:', p, 'ac:', p, 'bc:',p)\n",
    "print('Cosine Distance\\n ab:', p, 'ac:', p, 'bc:', p)\n",
    "print('Jaccard Dissimilarity (vectors should be boolean values)\\n ab:', p, 'ac:', p, 'bc:', p)\n",
    "\n",
    "print('\\n\\nThe most appropriate distance is...')\n",
    "print(p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda env:MLEnv]",
   "language": "python",
   "name": "conda-env-MLEnv-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
